19. Clean (Intro)
Clean: Intro
Improving Quality and Tidiness
Clean: Intro
Cleaning means acting on the assessments we made to improve quality and tidiness.
Improving Quality
Improving quality doesn’t mean changing the data to make it say something different—that's data fraud.
Consider the animals DataFrame, which has headers for name, body weight (in kilograms), and brain weight (in grams). The last five rows of this DataFrame are displayed below:

Examples of improving quality include:
- Correcting when inaccurate, like correcting the mouse's body weight to 0.023 kg instead of 230 kg
- Removing when irrelevant, like removing the row with "Apple" since an apple is a fruit and not an animal
- Replacing when missing, like filling in the missing value for brain weight for Brachiosaurus
- Combining, like concatenating the missing rows in the more_animals DataFrame displayed below

Improving Tidiness
Improving tidiness means transforming the dataset so that each variable is a column, each observation is a row, and each type of observational unit is a table. There are special functions in pandas that help us do that. We'll dive deeper into those in Lesson four of this course.
Programmatic Data Cleaning Process
Clean: Programmatic Data Cleaning Process
The programmatic data cleaning process:
- Define
- Code
- Test
Defining means defining a data cleaning plan in writing, where we turn our assessments into defined cleaning tasks. This plan will also serve as an instruction list so others (or us in the future) can look at our work and reproduce it.
Coding means translating these definitions to code and executing that code.
Testing means testing our dataset, often using code, to make sure our cleaning operations worked.